Data Analysis & Visualization - Michael Cheng¶
Project Problem Statement - AllLife Bank Customer Segmentation¶
Background¶
Context
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team that the penetration in the market can be improved. Based on this input, the marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers.
Another insight from the market research was that the customers perceive the support services of the bank poorly. Based on this, the operations team wants to upgrade the service delivery model, to ensure that customers' queries are resolved faster. The head of marketing and the head of delivery, both decide to reach out to the Data Science team for help.
Objective
Identify different segments in the existing customer base, taking into account their spending patterns as well as past interactions with the bank.
Data Description: Data is available on customers of the bank with their credit limit, the total number of credit cards the customer has, and different channels through which the customer has contacted the bank for any queries. These different channels include visiting the bank, online, and through a call center.
Sl_no - Customer Serial Number
Customer Key - Customer identification
Avg_Credit_Limit - Average credit limit (currency is not specified, you can make an assumption around this)
Total_Credit_Cards - Total number of credit cards
Total_visits_bank - Total bank visits
Total_visits_online - Total online visits
Total_calls_made - Total calls made
Import Libraries & Load Data¶
import plotly.io as pio
pio.renderers.default = 'notebook'
from IPython.display import HTML
HTML('''<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>''')
import pandas as pd
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Summary Tools
from summarytools import dfSummary
data2 = pd.read_excel("/mnt/e/mikecbos_E/Downloads/MIT_Elective-AllLife/Credit+Card+Customer+Data.xlsx")
Data Preprocessing¶
# Copy of data
df2 = data2.copy()
# Overview of data
print(df2.head())
df2.info()
dfSummary(df2)
Preliminary Observations
Dataset contains 660 rows, 7 columns with no missing values; all values are integers, representing customer data (credit and bank interactions)
The features align naturally to the following cateogires: CustomerID; CreditProfile; BankInteraction
a. Customer ID: SI_No and Customer Key
b. Credit Profile: Avg_Credit_Limit and Total_Credit_Cards
c. Bank Interaction): Total_visits_bank, Total_visits_online, and Total_calls_made
Customer Serial Number (Sl_No) has 660 distinct records whereas Customer Identification (Customer Key) has 655 distinct records; Need to review and verify for duplicates
Statistically:
a. Avg_Credit_Limit has the highest coefficient of variation (CV), thus substantial heterogenity
b. As a whole, BankInteraction metrics have moderate variability, with CV roughly between 1.0 and 1.5
Total_visits_bank has a limited range (0 to 5), with most not exceeding 5 visits; this implies customers' interaction is less reliant on the traditional brick-and-mortar aproach to banking
Total_visits_online has a wide range (0 to 15) with high variability (standard deviation 2.9 with a mean of 2.6) compared to physical visits, confirming customers' reliance on virtual over physical banking interactions; this contrasts the BankingInteraction metrics as a whole, and will benefit from deeper exploration
Total_calls_made has a relatively consistent variance (standard deviation 2.9 with at a mean at 3.6), with a long tail extending to the right; this makes up a group of outliers, where the subset of customers make significantly more calls than the majority, and will benefit from deeper exploration
c. Total_Credit_Cards show a low variance (standard deviation of 2.2), suggesting a stable distribution across the population
d. Long Tails are evident in Avg_Credit_Limit, Total_visits_online, and Total_calls_made, and will benefit from deeper exploration into their respective outliers
The CreditProfile category may represent a "low hanging fruit" investigation opportunity to discover potential hidden relationships, due to the high variability of Avg_Credit_Limit juxtaposed with the low variability of Total_Credit_Cards
Decision Point¶
- CustomerID fields are categorical, and can risk introducing noise to the analysis and clustering
- Create Customer_ID by concatenating Customer Key with SI_No to distinguish between records (perhaps due to historical transactions, shared access within the household, different purposed accounts for the same customer, etc.)
- SI_No and Customer Key then can be dropped
- The new Customer_ID can then be indexed as necessary in subsequent studies
CustomerID¶
# Inspect duplicates
duplicate_keys = df2[df2['Customer Key'].duplicated(keep=False)]
duplicate_keys.groupby('Customer Key').size().reset_index(name='Frequency')
duplicate_keys.sort_values(by='Customer Key')
# Create the Customer_ID by concatenating Customer Key and Sl_No
df2['Customer_ID'] = df2['Customer Key'].astype(str) + "_" + df2['Sl_No'].astype(str)
# Review the updated DataFrame
print(df2[['Customer Key', 'Sl_No', 'Customer_ID']].head(20))
# drop original CustomerID fields
df2 = df2.drop(['Customer Key', 'Sl_No'], axis = 1)
df2
df2.info()
CreditProfile¶
# Preliminary bivariate analysis
import matplotlib.pyplot as plt
import seaborn as sns
# Create a figure with two subplots: one for scatter plot and one for box plot
fig, ax = plt.subplots(2, 1, figsize=(10, 12), sharex=True, gridspec_kw={'height_ratios': [1, 3]})
# Scatter Plot
sns.scatterplot(
data=df2,
x='Total_Credit_Cards',
y='Avg_Credit_Limit',
ax=ax[0],
alpha=0.7
)
ax[0].set_title('Scatter Plot of Avg_Credit_Limit vs Total_Credit_Cards')
ax[0].set_ylabel('Avg_Credit_Limit')
ax[0].grid(visible=True)
# Box Plot
sns.boxplot(
data=df2,
x='Total_Credit_Cards',
y='Avg_Credit_Limit',
ax=ax[1]
)
ax[1].set_title('Box Plot of Avg_Credit_Limit Across Total_Credit_Cards')
ax[1].set_xlabel('Total_Credit_Cards')
ax[1].set_ylabel('Avg_Credit_Limit')
ax[1].grid(visible=True)
# Adjust layout
plt.tight_layout()
plt.show()
Observations¶
- Consistent with intuition: The more credit cards a customer has, the higher their credit limit
- Few outliers exist in the lower credit card groups, but are less frequent than in the higher groups. These outliers seem meaningful for further analysis, thus K-Medoids will be effective to incorporate these data points proportionally
# Kernel Density Estimation: Evaluate the high variability of Avg_Credit_Limit vs low variability of Total_Credit_Cards
# Coefficient of Variation
cv_credit_limit = (df2['Avg_Credit_Limit'].std() / df2['Avg_Credit_Limit'].mean()) * 100
cv_credit_cards = (df2['Total_Credit_Cards'].std() / df2['Total_Credit_Cards'].mean()) * 100
print(f"Raw Scores - Coefficient of Variation:")
print(f"CV of Avg_Credit_Limit: {cv_credit_limit:.2f}%")
print(f"CV of Total_Credit_Cards: {cv_credit_cards:.2f}%")
import numpy as np
# Log transformation for Avg_Credit_Limit to adjust scale
df2['Log_Avg_Credit_Limit'] = np.log1p(df2['Avg_Credit_Limit']) # Use log(1 + x) to handle zero values if present
# Overlayed Density Plot
plt.figure(figsize=(10, 6))
sns.kdeplot(df2['Log_Avg_Credit_Limit'], label='Log(Avg_Credit_Limit)', fill=True, color='blue', alpha=0.7)
sns.kdeplot(df2['Total_Credit_Cards'], label='Total_Credit_Cards', fill=True, color='orange', alpha=0.7)
# Plot Titles and Labels
plt.title('Overlayed Distributions of Credit Limit and Total Credit Cards', fontsize=14)
plt.xlabel('Distribution', fontsize=12)
plt.xticks([])
plt.yticks([])
plt.text(0.5, 0.95, "Note: Each region is independent and proportional to its own scale.",
fontsize=8,
color="gray",
ha="center",
va="center",
transform=plt.gca().transAxes
)
plt.text(0.01, -0.05, "Less <---",transform=plt.gca().transAxes, fontsize=12, ha='left', va='center')
plt.text(0.99, -0.05, '---> More', transform=plt.gca().transAxes, fontsize=12, ha='right', va='center')
plt.ylabel('Concentration of Customers', fontsize=12)
plt.legend(fontsize=10)
plt.grid(visible=True)
# Display the plot
plt.tight_layout()
plt.show()
Observations_KDE¶
- Log transformation was needed due to the scaling differences between the 2 factors
- Customers in the orange region have low credit cards and low credit limit (mostly being outside of the blue region)
- These customers may have limited banking engagement and/or fewer financial resources
- These customers may also have banking relationships elsewhere
- Marketing to these customers may be more allusive, since it may entail a long-term endeavor with few "quick wins", thus more of an incidental rather than intentional nature
- Customers in the blue region have high credit limit
- Due to the overlap of orange within this blue region, there is ambiguity as to whether they have many or few credit cards (graph is not to scale)
- Ambiguity from the overlapping 2 regions will therefore need further bivariate analysis
- AUC Comparison, normalized for absolute scale, along with KMeans Customer-Level Analysis and Segment-Specific Insights can provide a more methodical approach for predictive analysis and marketing with intentionality to these customers than the customers in the orange region
# Further Bivariate Analysis: AUC Comparison
from sklearn.preprocessing import StandardScaler
from scipy.stats import gaussian_kde
from scipy.integrate import quad
'''
import numpy as np
from scipy.stats import gaussian_kde
from scipy.integrate import quad
import matplotlib.pyplot as plt
'''
# Step 1: Standardize both variables
scaler = StandardScaler()
df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']] = scaler.fit_transform(
df2[['Avg_Credit_Limit', 'Total_Credit_Cards']]
)
# Step 2: Define KDEs for standardized data
kde_credit_limit = gaussian_kde(df2['Standardized_Credit_Limit'])
kde_credit_cards = gaussian_kde(df2['Standardized_Credit_Cards'])
# Common X range
x_min = min(df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']].min())
x_max = max(df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']].max())
x_range = np.linspace(x_min, x_max, 1000)
y_credit_limit = kde_credit_limit(x_range)
y_credit_cards = kde_credit_cards(x_range)
# Step 3: Calculate Overlap
def overlap_area(x):
return min(kde_credit_limit(x), kde_credit_cards(x))
overlap_auc, _ = quad(overlap_area, x_min, x_max)
# Total AUCs
total_auc_credit_limit = quad(lambda x: kde_credit_limit(x), x_min, x_max)[0]
total_auc_credit_cards = quad(lambda x: kde_credit_cards(x), x_min, x_max)[0]
# Normalize overlap
overlap_ratio_credit_limit = overlap_auc / total_auc_credit_limit
overlap_ratio_credit_cards = overlap_auc / total_auc_credit_cards
# Visualization
plt.figure(figsize=(10, 6))
plt.plot(x_range, y_credit_limit, label='Standardized Avg_Credit_Limit (KDE)', color='blue')
plt.plot(x_range, y_credit_cards, label='Standardized Total_Credit_Cards (KDE)', color='orange')
plt.fill_between(
x_range,
np.minimum(y_credit_limit, y_credit_cards),
color='purple',
alpha=0.5,
label='Overlap Region'
)
plt.title('Overlapping AUC Between Standardized Avg_Credit_Limit and Total_Credit_Cards')
plt.xlabel('Standardized Value Range')
plt.ylabel('Density')
plt.legend()
plt.grid()
plt.show()
# Print Results
print(f"Overlap AUC: {overlap_auc:.4f}")
print(f"Total AUC (Standardized Avg_Credit_Limit): {total_auc_credit_limit:.4f}")
print(f"Total AUC (Standardized Total_Credit_Cards): {total_auc_credit_cards:.4f}")
print(f"Overlap as % of Standardized Avg_Credit_Limit AUC: {overlap_ratio_credit_limit:.2%}")
print(f"Overlap as % of Standardized Total_Credit_Cards AUC: {overlap_ratio_credit_cards:.2%}")
Observations¶
The earlier sliver of overlap makes up for 68% of the customer dataset. The towering blue and orange regions combined makes up for 32%. Therefore this high absolute overlap area suggests a significant proportion of the distributions align, and that the ranges of credit limits and credit card counts are shared for many customers. The total AUC scores above confirms the Kernel Density Estimations are well-defined and appropriately scaled.
Of the purple overlapping region,
- 68.08% pertains to Avg_Credit_Limit
- 71.43% pertains to Total_Credit_Cards These percentages suggest a meaningful overlap between the two variables.
The remaining ~30% non-overlapping areas may represent unique customer groups that can be further reviewed.
# Credit Profile/Upselling: EDA
from sklearn.cluster import KMeans
import warnings
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning) # Suppress UserWarnings
# Extract normalized data
normalized_data = df2[['Standardized_Credit_Limit', 'Standardized_Credit_Cards']]
# Apply KMeans Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df2['Cluster'] = kmeans.fit_predict(normalized_data)
# Visualize Clusters
plt.figure(figsize=(10, 6))
plt.scatter(
df2['Standardized_Credit_Limit'],
df2['Standardized_Credit_Cards'],
c=df2['Cluster'],
cmap='viridis',
alpha=0.6
)
plt.title('Customer Segments Based on Credit Limit and Credit Cards')
plt.xlabel('Standardized Credit Limit')
plt.ylabel('Standardized Total Credit Cards')
plt.colorbar(label='Cluster')
plt.grid()
plt.show()
EDA_Observations_on_Credit_Profile¶
The clusters from this preliminary Credit Profile EDA plot can also be helpful to reveal distinct customer segments based on standardized credit limit and credit card counts. Along with the previous EDA on CustomerID, Credit Profile are non-volitional factors that indirectly drive costs, revenue and/or service dissatisfaction. Engaging with these factors are more incidental (i.e. being ready for when the prospect/customer is willing and able to buy). Interpretations will follow after further PCA and Ensemble Clustering analyses.
DecisionPoint_Upsell¶
- While the bank is looking to upsell to its existing customers [2], the dataset provides a very limited view on how to directly contribute to any upselling efforts. A potential proxy for upselling can be Total_Credit_Cards if we assume that holding more credit cards will correlate with higher customer value (i.e. higher revenue, loyalty, or engagement)
- However, even with using Total_Credit_Cards as a proxy, too many cards can lead to diminished returns (i.e. credit risk for customers, high level of service for the bank, etc.)
- Given the bank's focus on credit cards however, "Upselling" will pertain to credit cards and loan products as data inference can be drawn from the relationships between Avg_Credit_Limit and Total_Credit_Cards
PCA and Ensemble Clustering¶
Banking_Interaction_(bank_visits,_online_visits,_and_calls_made)¶
- Banking Interaction does not represent a goal in this study as there are many ways banking interactions can be a cost to the business, while at the same time, presenting revenue opportunities (thus involving confounding factors)
- These variables represent volitional factors that influence costs, revenue, and/or service dissatisfaction
- The limited dataset presents Total_Credit_Cards and Credit_Limit as proxies for upselling [3]
- Banking Interaction, therefore, is meaningful as it pertains to upselling opportunities, specifically for Credit Card and Loan Products
PCA and Clustering Model Analysis to evaluate:¶
UpsellingOpportunities¶
Upselling: PCA¶
# Preprocess data, use PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
features = ['Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made', 'Avg_Credit_Limit']
scaler = StandardScaler()
normalized_data = scaler.fit_transform(df2[features])
pca = PCA()
pca_data = pca.fit_transform(normalized_data)
# Plot explained variance ratio
import matplotlib.pyplot as plt
plt.plot(range(1, len(features) + 1), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()
# Review Principal Components: Access PCA loadings
loadings = pd.DataFrame(
pca.components_,
columns=features, # Original feature names
index=[f'PC{i+1}' for i in range(len(features))] # Label components
)
print(loadings)
# Visualize Principal Components impact
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(loadings, annot=True, cmap='coolwarm')
plt.title('Upselling PCA Loadings Heatmap')
plt.xticks(fontsize=8, rotation = 45)
plt.show()
Finding¶
Although the PCA cumulative explained variance plot suggests 2 components as an inflection point, the heat map of loadings reveals that dropping to 2 components would result in the loss of important feature contributions, particularly from PCs 3, 4, and 5, which capture nuanced and actionable patterns in the data.
Decision Point¶
Based on the loadings and their contributions across all 5 principal components (PCs), including all 5 components appears to be meaningful, especially for capturing nuanced behaviors and contrasts in the data. This approach will capture detailed behavioral patterns (e.g., identifying low-credit, digitally active customers in PC5, etc.), and/or diversity within the customer base. With the limited dataset, the inclusion of all 5 components will not overly complicate subsequent clustering analysis.
Upselling: Clustering Analysis¶
# KMeans / GMM / KMedoids
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric
import matplotlib.pyplot as plt
import warnings
import pandas as pd
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Suppress FutureWarnings
warnings.filterwarnings("ignore", category=UserWarning) # Suppress UserWarnings
# ===== Step 1: Determine Optimal Number of Clusters (Elbow Plot) =====
# Calculate WCSS for different numbers of clusters
wcss = []
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(pca_data)
wcss.append(kmeans.inertia_)
# Plot Elbow Curve
plt.figure(figsize=(5, 3))
plt.plot(range(1, 10), wcss, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal Clusters')
plt.show()
# Optimal number of clusters (can be adjusted based on the Elbow Plot)
optimal_k = 3
# ===== Step 2: Add PCA Features to DataFrame =====
# Define feature column names
features = [f"PC{i+1}" for i in range(pca_data.shape[1])] # Assuming PCA was used
# Create a DataFrame from PCA data if it isn't already part of df2
for i, feature in enumerate(features):
df2[feature] = pca_data[:, i]
# ===== Step 3: Apply Clustering Methods =====
# ---- KMeans Clustering ----
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_clusters = kmeans.fit_predict(pca_data)
df2['KMeans_Cluster'] = kmeans_clusters
# ---- Gaussian Mixture Model (GMM) ----
gmm = GaussianMixture(n_components=optimal_k, random_state=42)
gmm_clusters = gmm.fit_predict(pca_data)
df2['GMM_Cluster'] = gmm_clusters
# ---- K-Medoids Clustering ----
# Define initial medoids (indices based on domain knowledge or random)
initial_medoids = [0, 50, 100] # Example indices for 3 clusters
kmedoids_instance = kmedoids(
pca_data, initial_medoids, metric=distance_metric(type_metric.EUCLIDEAN)
)
kmedoids_instance.process()
# Extract K-Medoids cluster assignments
kmedoids_clusters = kmedoids_instance.get_clusters()
df2['KMedoids_Cluster'] = -1
for cluster_id, indices in enumerate(kmedoids_clusters):
df2.loc[indices, 'KMedoids_Cluster'] = cluster_id
# ===== Step 4: Analyze Clusters =====
# Function to analyze cluster profiles
def analyze_clusters(df, cluster_column, feature_columns):
cluster_profiles = df.groupby(cluster_column)[feature_columns].mean()
cluster_profiles.index = cluster_profiles.index + 1 # Make clusters 1-based index
return cluster_profiles
# Generate cluster profiles for each method
print("KMeans Cluster Profiles (averages):")
print(analyze_clusters(df2, 'KMeans_Cluster', features))
print("\nGMM Cluster Profiles (averages):")
print(analyze_clusters(df2, 'GMM_Cluster', features))
print("\nKMedoids Cluster Profiles (averages):")
print(analyze_clusters(df2, 'KMedoids_Cluster', features))
Observations¶
- The Elbow Plot identifies use of 3 clusters is optimal
- KMeans and KMedoids results are very consistent, both being centroid-based, particularly for Clusters 1 and 3
- GMM, being more sensitive to data distribution, captures different clustering behaviors
Upselling: Ensemble Analysis¶
# 3-Model Ensemble
from sklearn.metrics import silhouette_score
import numpy as np
from scipy.stats import mode
# Assume the previous models have populated the following columns in df2:
# 'KMeans_Cluster', 'GMM_Cluster', 'KMedoids_Cluster'
# Extract cluster labels from the three models
kmeans_labels = df2['KMeans_Cluster'].to_numpy()
gmm_labels = df2['GMM_Cluster'].to_numpy()
kmedoids_labels = df2['KMedoids_Cluster'].to_numpy()
# Combine the cluster labels into a single array
all_labels = np.array([kmeans_labels, gmm_labels, kmedoids_labels])
# Perform majority voting to generate ensemble cluster assignments
ensemble_labels = mode(all_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster'] = ensemble_labels
# Calculate silhouette scores for all models, including the ensemble
kmeans_silhouette = silhouette_score(pca_data, kmeans_labels)
gmm_silhouette = silhouette_score(pca_data, gmm_labels)
kmedoids_silhouette = silhouette_score(pca_data, kmedoids_labels)
ensemble_silhouette = silhouette_score(pca_data, ensemble_labels)
# Print silhouette scores for comparison
print("Silhouette Scores for Clustering Models:")
print(f"KMeans Silhouette Score: {kmeans_silhouette:.4f}")
print(f"GMM Silhouette Score: {gmm_silhouette:.4f}")
print(f"KMedoids Silhouette Score: {kmedoids_silhouette:.4f}")
print(f"3-Model Ensemble Silhouette Score: {ensemble_silhouette:.4f}")
Findings¶
Based on the above silhouette scores, GMM's distinctive clustering is not the best fit here. KMeans and KMedoids are the best performing models. An ensemble combining these 2 models will be evaluated next.
# KMeans + KMedoids 2-model Ensemble
# Combine KMeans and KMedoids labels
refined_labels = np.array([kmeans_labels, kmedoids_labels])
ensemble_labels = mode(refined_labels, axis=0)[0].flatten()
df2['Ensemble_Cluster'] = ensemble_labels
# Calculate silhouette score for the refined ensemble
ensemble_silhouette = silhouette_score(pca_data, ensemble_labels)
# Print silhouette scores for comparison
print("Silhouette Scores for Revised Ensemble Clustering Models:")
print(f"KMeans Silhouette Score: {kmeans_silhouette:.4f}")
print(f"KMedoids Silhouette Score: {kmedoids_silhouette:.4f}")
print(f"Refined (2-Model) Ensemble Silhouette Score: {ensemble_silhouette:.4f}")
Findings¶
The revised ensemble with only KMeans and KMedoids is still not as robust as the individual models. Thus, KMeans and KMedoids will be used, with their complementary strengths.
# Visualize KMeans and KMedoids
import plotly.express as px
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42) # Adjust perplexity as needed
tsne_results = tsne.fit_transform(pca_data) # Use PCA-reduced data or normalized original data
# Ensure original index is retained
tsne_df2 = pd.DataFrame(tsne_results, columns=['TSNE1', 'TSNE2'], index=df2.index)
# Add cluster labels and original fields
tsne_df2['KMeans_Cluster'] = df2['KMeans_Cluster']
#tsne_df2['GMM_Cluster'] = df2['GMM_Cluster']
tsne_df2['KMedoids_Cluster'] = df2['KMedoids_Cluster']
# Specific fields from df2 for hover information
fields_to_include = ['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online']
tsne_df2 = tsne_df2.join(df2[fields_to_include])
# Visualize with Plotly for K-Means
fig_kmeans = px.scatter(
tsne_df2, x='TSNE1', y='TSNE2', color='KMeans_Cluster',
hover_data=fields_to_include, # Add fields for hover information
title='t-SNE Visualization with K-Means Clusters',
color_continuous_scale='Viridis',
)
fig_kmeans.update_xaxes(showticklabels=False) # Hide x-axis tick labels
fig_kmeans.update_yaxes(showticklabels=False) # Hide y-axis tick labels
fig_kmeans.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmeans.show()
# Visualize with Plotly for GMM
#fig_gmm = px.scatter(
#tsne_df2, x='TSNE1', y='TSNE2', color='GMM_Cluster',
#hover_data=fields_to_include,
#title='t-SNE Visualization with GMM Clusters',
#color_continuous_scale='Viridis'
#)
#fig_gmm.update_xaxes(showticklabels=False) # Hide x-axis tick labels
#fig_gmm.update_yaxes(showticklabels=False) # Hide y-axis tick labels
#fig_gmm.show()
# Visualize with Plotly for K-Medoids
fig_kmedoids = px.scatter(
tsne_df2, x='TSNE1', y='TSNE2', color='KMedoids_Cluster',
hover_data=fields_to_include,
title='t-SNE Visualization with K-Medoids Clusters',
color_continuous_scale='Viridis'
)
fig_kmedoids.update_xaxes(showticklabels=False) # Hide x-axis tick labels
fig_kmedoids.update_yaxes(showticklabels=False) # Hide y-axis tick labels
fig_kmedoids.update_layout(autosize=False, width=800, height=350, coloraxis_showscale=False)
fig_kmedoids.show()